Tidying Data

Tidying of data is required for many reasons including these:

  • The names of the variables are different from what you require
  • Missing data
  • Values are not in units that you require
  • Period of sampling of records is not what you need
  • Variables are categorical and you neeed quantitative values
  • There is noise in the data
  • Information is of an incorrect type
  • Data is organized around incorrect axes
  • Data is at the wrong level of normalization
  • Data is duplicated

Moving away from a list of problems with data that needs to be addressed, there are several characterisitics of data that can be considered good, tidy and ready for analysis which are as follows:

  • Each variable is in one column
  • Each observation of the variable is in a different row
  • There should be one table for each kind of variable
  • If multiple tables they should be relatable
  • Qualitative and categorical variables have mappings to values useful for analysis

Setting up notebook


In [2]:
# import pandas, numpy and datetime
import numpy as np
import pandas as pd
import datetime

# set some pandas options for controlling output
pd.set_option('display.notebook_repr_html', False)
pd.set_option('display.max_columns',10)
pd.set_option('display.max_rows',10)

Working with Missing Data

Data is "missing" in pandas when it has a value of NaN (also seen as np.nan - the form from NumPy). The NaN value represents that in a particular Series that there is not a value specified for the particular index level.

In pandas, there are a number of reasons why a value can be NaN:

  • Join of two sets of data does not have matched values
  • Data that you retrieved from an external source is incomplete
  • NaN is not known at a give point in time and will be filled in later
  • There is a data collection error retrieving a value, but the event must still be recorded in the index
  • Reindexing of data has resulted in an index that does not have a value
  • Shape of a data has changed, there are new additional rows or columns

In [3]:
# create a DataFrame with 5 rows and 3 columns
df = pd.DataFrame(np.arange(0,15).reshape(5,3),index=['a','b','c','d','e'], columns=['col1','col2','col3'])
df


Out[3]:
   col1  col2  col3
a     0     1     2
b     3     4     5
c     6     7     8
d     9    10    11
e    12    13    14

In [4]:
# add some columns and rows to the DataFrame
# column c4 with NaN values
df['c4'] = np.nan
# row 'f' with 15 through 18
df.loc['f'] = np.arange(15,19)
# row 'g' with all NaN
df.loc['g'] = np.nan
# column c5 with NaN's
df['c5'] = np.nan
# change value in col 'c4' row 'a'
df['c4']['a']=20
df


Out[4]:
   col1  col2  col3    c4  c5
a   0.0   1.0   2.0  20.0 NaN
b   3.0   4.0   5.0   NaN NaN
c   6.0   7.0   8.0   NaN NaN
d   9.0  10.0  11.0   NaN NaN
e  12.0  13.0  14.0   NaN NaN
f  15.0  16.0  17.0  18.0 NaN
g   NaN   NaN   NaN   NaN NaN

This DataFrame object exhibits the following characteristics that will support most of the examples that follows:

  • One row consisting only of NaN values
  • One column is consisting only of NaN values
  • Several rows and columns consisting of both numeric and NaN values

Determining NaN values in Series and DataFrame objects


In [5]:
# which items are NaN?
df.isnull()


Out[5]:
    col1   col2   col3     c4    c5
a  False  False  False  False  True
b  False  False  False   True  True
c  False  False  False   True  True
d  False  False  False   True  True
e  False  False  False   True  True
f  False  False  False  False  True
g   True   True   True   True  True

In [6]:
# count the number of NaN values in each column
df.isnull().sum()


Out[6]:
col1    1
col2    1
col3    1
c4      5
c5      7
dtype: int64

In [7]:
# total count of NaN values
df.isnull().sum().sum()


Out[7]:
15

In [8]:
# number of non-NaN values in each column
df.count()


Out[8]:
col1    6
col2    6
col3    6
c4      2
c5      0
dtype: int64

In [9]:
# and can used for counting NaN values
(len(df)-df.count()).sum()


Out[9]:
15

In [10]:
# which items are not null
df.notnull()


Out[10]:
    col1   col2   col3     c4     c5
a   True   True   True   True  False
b   True   True   True  False  False
c   True   True   True  False  False
d   True   True   True  False  False
e   True   True   True  False  False
f   True   True   True   True  False
g  False  False  False  False  False

Selecting out or dropping missing data


In [11]:
# select the non-NaN items in column c4
df.c4[df.c4.notnull()]


Out[11]:
a    20.0
f    18.0
Name: c4, dtype: float64

In [12]:
# .dropna will also return non NaN values
# this gets all non NaN items in column c4
df.c4.dropna()


Out[12]:
a    20.0
f    18.0
Name: c4, dtype: float64

In [13]:
# dropna returns as copy with the values dropped
# the source DataFrame / column is not changed
df.c4


Out[13]:
a    20.0
b     NaN
c     NaN
d     NaN
e     NaN
f    18.0
g     NaN
Name: c4, dtype: float64

When applied to a DataFrame object, dropna() will drop all rows froms a DataFrame object that have atleast one NaN value. If you want to drop only rows where all values are NaN, you can use the how="all" parameter.


In [14]:
# on a dataframe this will drop entire rows
# where there is atleast one NaN
# in this case, all the rows
df.dropna()


Out[14]:
Empty DataFrame
Columns: [col1, col2, col3, c4, c5]
Index: []

In [15]:
# using how='all', only rows that have all values
# as NaN will be dropped
df.dropna(how='all')


Out[15]:
   col1  col2  col3    c4  c5
a   0.0   1.0   2.0  20.0 NaN
b   3.0   4.0   5.0   NaN NaN
c   6.0   7.0   8.0   NaN NaN
d   9.0  10.0  11.0   NaN NaN
e  12.0  13.0  14.0   NaN NaN
f  15.0  16.0  17.0  18.0 NaN

In [16]:
# flip to drop columns instead of rows
df.dropna(how='all',axis=1) # c5 column will be dropped


Out[16]:
   col1  col2  col3    c4
a   0.0   1.0   2.0  20.0
b   3.0   4.0   5.0   NaN
c   6.0   7.0   8.0   NaN
d   9.0  10.0  11.0   NaN
e  12.0  13.0  14.0   NaN
f  15.0  16.0  17.0  18.0
g   NaN   NaN   NaN   NaN

In [19]:
# make a copy of df
df2 = df.copy()
# replace two NaN cells with values
df2.loc['g'].col1=0
df2.loc['g'].col3=0
df2


Out[19]:
   col1  col2  col3    c4  c5
a   0.0   1.0   2.0  20.0 NaN
b   3.0   4.0   5.0   NaN NaN
c   6.0   7.0   8.0   NaN NaN
d   9.0  10.0  11.0   NaN NaN
e  12.0  13.0  14.0   NaN NaN
f  15.0  16.0  17.0  18.0 NaN
g   0.0   NaN   0.0   NaN NaN

In [20]:
# now drop columns with any NaN values
df2.dropna(how='any',axis=1)


Out[20]:
   col1  col3
a   0.0   2.0
b   3.0   5.0
c   6.0   8.0
d   9.0  11.0
e  12.0  14.0
f  15.0  17.0
g   0.0   0.0

In [21]:
# only drop columns with at least 5 NaN values
df.dropna(thresh=5,axis=1)


Out[21]:
   col1  col2  col3
a   0.0   1.0   2.0
b   3.0   4.0   5.0
c   6.0   7.0   8.0
d   9.0  10.0  11.0
e  12.0  13.0  14.0
f  15.0  16.0  17.0
g   NaN   NaN   NaN

Note that the .dropna() method returns a copy of the DataFrame object, and the data is dropped from that copy. If you want to drop the data in the actual DataFrame, use the inplace=True parameter.

How pandas handles NaN values in mathematical operations

The NaN values are handled differently in pandas than in NumPy. NumPy functions when encountering a NaN value, will return NaN. pandas functions will typically ignore the NaN values and continue processing the function as though the values were not part of the Series object.

More specifically the way that pandas handles NaN values is as follows:

  • Summing of data treats NaN as 0
  • If all values are NaN, the result is NaN
  • Methods like .cumsum() and .cumprod() ignore NaN values but preserve them in resulting arrays

In [22]:
# create a NumPy array with one NaN value
a = np.array([1,2,np.nan,3])
# create a Series from the array
s = pd.Series(a)
# the mean of each is different
a.mean(), s.mean()


Out[22]:
(nan, 2.0)

In [23]:
# demonstrate sum,mean and cumsum handling of NaN
# get one column
s = df.c4
s.sum()   # NaN values treated a 0


Out[23]:
38.0

In [24]:
s.mean() # NaN treated as 0


Out[24]:
19.0

In [25]:
# as 0 in the cumsum but NaN values preserved in result series
s.cumsum()


Out[25]:
a    20.0
b     NaN
c     NaN
d     NaN
e     NaN
f    38.0
g     NaN
Name: c4, dtype: float64

In [26]:
# in arithmetic, a NaN value will result in NaN
df.c4 + 1


Out[26]:
a    21.0
b     NaN
c     NaN
d     NaN
e     NaN
f    19.0
g     NaN
Name: c4, dtype: float64

Filling in missing data

If you prefer to replace the NaN values with a specific value, instead of having them propagated or flat out ignored, you can use the .fillna() method. The following code fills the NaN values with 0:


In [27]:
# return a new DataFrame with NaN values filled with 0
filled = df.fillna(0)
filled


Out[27]:
   col1  col2  col3    c4   c5
a   0.0   1.0   2.0  20.0  0.0
b   3.0   4.0   5.0   0.0  0.0
c   6.0   7.0   8.0   0.0  0.0
d   9.0  10.0  11.0   0.0  0.0
e  12.0  13.0  14.0   0.0  0.0
f  15.0  16.0  17.0  18.0  0.0
g   0.0   0.0   0.0   0.0  0.0

In [28]:
# having replaced NaN with 0 can make
# operations such as mean have different results
filled.mean()


Out[28]:
col1    6.428571
col2    7.285714
col3    8.142857
c4      5.428571
c5      0.000000
dtype: float64

It is also possible to limit the number of times that the data will be filled using the limit parameter. Each time the NaN values are identified, pandas will fill the NaN values with the previous value up to the limit times in each group of NaN values.


In [29]:
# only fills the first two NaN values in each row with 0
df.fillna(0,limit=2)


Out[29]:
   col1  col2  col3    c4   c5
a   0.0   1.0   2.0  20.0  0.0
b   3.0   4.0   5.0   0.0  0.0
c   6.0   7.0   8.0   0.0  NaN
d   9.0  10.0  11.0   NaN  NaN
e  12.0  13.0  14.0   NaN  NaN
f  15.0  16.0  17.0  18.0  NaN
g   0.0   0.0   0.0   NaN  NaN

Forward and Backward Filling of Missing Values

Gaps in data can be filled by propagating the non-NaN values forward or backward along a Series. To demonstrate this, the following example will "fill forward" the c4 column of DataFrame:


In [30]:
# extract the c4 column and fill NaNs forward
df.c4.fillna(method="ffill")


Out[30]:
a    20.0
b    20.0
c    20.0
d    20.0
e    20.0
f    18.0
g    18.0
Name: c4, dtype: float64

In [31]:
# perform a backword fill
df.c4.fillna(method="bfill")


Out[31]:
a    20.0
b    18.0
c    18.0
d    18.0
e    18.0
f    18.0
g     NaN
Name: c4, dtype: float64

You can also use the convenient functions pd.ffill() or pd.bfill()

Filling Using index labels


In [33]:
# create a new series of values to be
# used to fill NaN values where the index label matches
fill_values = pd.Series([100,101,102], index=['a','e','g'])
fill_values


Out[33]:
a    100
e    101
g    102
dtype: int64

In [34]:
# using c4, fill using fill_values
# a, e and g will be filled with matching values
df.c4.fillna(fill_values)


Out[34]:
a     20.0
b      NaN
c      NaN
d      NaN
e    101.0
f     18.0
g    102.0
Name: c4, dtype: float64

In [35]:
# fill NaN values in each column with the
# mean of the values in that column
df.fillna(df.mean())


Out[35]:
   col1  col2  col3    c4  c5
a   0.0   1.0   2.0  20.0 NaN
b   3.0   4.0   5.0  19.0 NaN
c   6.0   7.0   8.0  19.0 NaN
d   9.0  10.0  11.0  19.0 NaN
e  12.0  13.0  14.0  19.0 NaN
f  15.0  16.0  17.0  18.0 NaN
g   7.5   8.5   9.5  19.0 NaN

Interpolation of missing values

Both DataFrame and Series have an .interpolate() method that will, by default, perform a linear interpolation of missing values:


In [36]:
# linear interpolate the NaN values from 1 through 2
s = pd.Series([1,np.nan,np.nan,np.nan,2])
s.interpolate()


Out[36]:
0    1.00
1    1.25
2    1.50
3    1.75
4    2.00
dtype: float64

The value of interpolation is calculated by taking the first value before and after any sequence of NaN values and then incrementally adding that value from the start and substituting NaN values. In this case, 2.0 and 1.0 are the surrounding values resulting in (2.0-1.0)/(5-1)=0.25 which is then added incrementally through all the NaN values.

The interpolation method also has the ability to specify a specific method of interpolation, one of the common methods is to use time-based interpolation.


In [37]:
# create a time series, but missing one date in the Series
ts = pd.Series([1,np.nan,3],index=[datetime.datetime(2014,1,1),datetime.datetime(2014,2,1),datetime.datetime(2014,4,1)])
ts


Out[37]:
2014-01-01    1.0
2014-02-01    NaN
2014-04-01    3.0
dtype: float64

In [38]:
ts.interpolate()


Out[38]:
2014-01-01    1.0
2014-02-01    2.0
2014-04-01    3.0
dtype: float64

The important thing to note is that the series is missing an entry for 2014-03-01. If we were expecting to interpolate daily values, there would be two values calculated one for 2014-02-01 and another for 2014-03-01 resulting in one more value in the numerator of the interpolation.


In [39]:
# this accounts for the fact that we dont have
# an entry for 2014-03-01
ts.interpolate(method="time")


Out[39]:
2014-01-01    1.000000
2014-02-01    1.688889
2014-04-01    3.000000
dtype: float64

Interpolation can also be specified to calculate values relative to the index values when using numeric index labels.


In [40]:
# a series to demonstrate index label based interpolation
s = pd.Series([0,np.nan,100], index=[0,1,10])
s


Out[40]:
0       0.0
1       NaN
10    100.0
dtype: float64

In [41]:
# linear interpolate
s.interpolate()


Out[41]:
0       0.0
1      50.0
10    100.0
dtype: float64

In [42]:
# interpolate based upon the values in the index
s.interpolate(method="values")


Out[42]:
0       0.0
1      10.0
10    100.0
dtype: float64

Now the value calculated for NaN is interpolated using relative positioning based upon the labels in the index. The NaN value has a label of 1, which is one tenth of the way between 0 and 10 so the interpolated value will be 0 + (100-0)/10 or 10.

Handling Duplicate Data

Often it is considered best to erron the side of having duplicates instead of missing data, especially if the data is considered to be idempotent. However duplicate data can increase the size of the dataset and if it is not idempotent, then it would not be appropriate to process the duplicates.


In [43]:
# a DataFrame with lots of duplicate data
data = pd.DataFrame({'a':['x'] * 3 + ['y'] * 4,'b':[1,1,2,3,3,4,4]})
data


Out[43]:
   a  b
0  x  1
1  x  1
2  x  2
3  y  3
4  y  3
5  y  4
6  y  4

In [44]:
# reports which rows are duplicates based upon
# if the data in all columns was seen before
data.duplicated()


Out[44]:
0    False
1     True
2    False
3    False
4     True
5    False
6     True
dtype: bool

In [45]:
# drop duplicate rows retaining first row of the duplicates
data.drop_duplicates()


Out[45]:
   a  b
0  x  1
2  x  2
3  y  3
5  y  4

The default operation is to keep the first row of the duplicates. If you want to keep the last row of duplicates, you can use the take_last=True parameter.


In [49]:
# drop duplicate rows only keeping the first instance of any data
data.drop_duplicates(keep="last")


Out[49]:
   a  b
1  x  1
2  x  2
4  y  3
6  y  4

In [50]:
# add a column c with values 0..6
# this makes .duplicated() report no duplicate rows
data['c'] = range(7)
data.duplicated()


Out[50]:
0    False
1    False
2    False
3    False
4    False
5    False
6    False
dtype: bool

In [51]:
# but if we specify duplicates to be dropped in columns a & b
# they will be dropped
data.drop_duplicates(['a','b'])


Out[51]:
   a  b  c
0  x  1  0
2  x  2  2
3  y  3  3
5  y  4  5

Transforming Data

Transformation is required for following reasons:

  • Values are not in the correct units
  • Values are qualitative and need to be converted to appropriate numeric values
  • Extraneous data that either wastes memory and processing time or can affect results simply by being included

To address these situations we can take one or more of the following actions:

  • Map values to other values using a table lookup process
  • Explicitly replace certain values with other values
  • Apply methods to transform the values based on an algorithm
  • Simple remove extraneous columns and rows

Mapping

pandas provides a generic ability to map values using a lookup table using the .map() method. This method performs the mapping by matching the values of the outer Series with the index labels of the inner Series returning a new Series with the index labels of the outer Series but the values from the inner Series:


In [52]:
# create two Series objects to demonstrate mapping
x = pd.Series({"one":1,"two":2,"three":3})
y = pd.Series({1:"a",2:"b",3:"c"})
x


Out[52]:
one      1
three    3
two      2
dtype: int64

In [53]:
y


Out[53]:
1    a
2    b
3    c
dtype: object

In [54]:
# map values in x to values in y
x.map(y)


Out[54]:
one      a
three    c
two      b
dtype: object

In [55]:
# three in x will not align / map to a value in y
x = pd.Series({"one":1,"two":2,"three":3})
y = pd.Series({1:"a",2:"b"})
x.map(y)


Out[55]:
one        a
three    NaN
two        b
dtype: object

Replacing Values


In [56]:
# create a Series to demonstrate replace
s = pd.Series([0.,1.,2.,3.,4.])
s


Out[56]:
0    0.0
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64

In [57]:
# replace all items with index label 2 with value 5
s.replace(2,5)


Out[57]:
0    0.0
1    1.0
2    5.0
3    3.0
4    4.0
dtype: float64

In [58]:
# replace all items with new values
s.replace([0,1,2,3,4],[4,3,2,1,0])


Out[58]:
0    4.0
1    3.0
2    2.0
3    1.0
4    0.0
dtype: float64

In [59]:
# replace using entries in a dictionary
s.replace({0:10,1:100})


Out[59]:
0     10.0
1    100.0
2      2.0
3      3.0
4      4.0
dtype: float64

If using .replace() on a DataFrame, it is possible to specify different replacement values for each column. This is performed by passing a Python dictionary to the .replace() method, where the keys of the dictionary represent the names of the columns where replacement is to occur and the values of the dictionary are values that you want to replace. The second parameter to the method is the value that will be replaced where any matches are found.


In [60]:
# DataFrame with two columns
df = pd.DataFrame({'a':[0,1,2,3,4],'b':[5,6,7,8,9]})
df


Out[60]:
   a  b
0  0  5
1  1  6
2  2  7
3  3  8
4  4  9

In [61]:
# specify different replacement values for each column
df.replace({'a':1,'b':8}, 100)


Out[61]:
     a    b
0    0    5
1  100    6
2    2    7
3    3  100
4    4    9

Replacing specific values in each of the columns is very convenient, as it provides a shorthand for what otherwise would require coding a loop through all the columns.


In [64]:
# demonstrate replacement with pad method
# set first item to 10, to have a distinct replacement value
s[0] = 10
s


Out[64]:
0    10.0
1     1.0
2     2.0
3     3.0
4     4.0
dtype: float64

In [65]:
# replace items with index label 1,2,3 using fill from the
# most recent value prior to the specified labels (10)
s.replace([1,2,3],method='pad')


Out[65]:
0    10.0
1    10.0
2    10.0
3    10.0
4     4.0
dtype: float64

Applying Functions to Transform Data

pandas provides the ability to apply functions to individual items, entire columns, entire rows providing incredible flexibility in transformation.

Functions can be applied using the conveniently named .apply() method, which given a Python function, will iteratively call the function passing in each value from a Series, or each Series representing a DataFrame column, or a list of values representing each row in a DataFrame.


In [66]:
# demonstrate applying a function to every item of a series
s = pd.Series(np.arange(0,5))
s.apply(lambda x: x * 2)


Out[66]:
0    0
1    2
2    4
3    6
4    8
dtype: int64

In [67]:
# demonstrate applying a sum on each column
df = pd.DataFrame(np.arange(12).reshape(4,3),columns=['a','b','c'])
df


Out[67]:
   a   b   c
0  0   1   2
1  3   4   5
2  6   7   8
3  9  10  11

In [68]:
# calculate cumulative sum of items in each column
df.apply(lambda col: col.sum())


Out[68]:
a    18
b    22
c    26
dtype: int64

In [69]:
# calculate the sum of items in each row
df.apply(lambda row: row.sum(),axis=1)


Out[69]:
0     3
1    12
2    21
3    30
dtype: int64

A common practice is to take result of an apply operation and add it as a new column of the DataFrame. This is convenient as you can add into the DataFrame the result of one or more successive calculations.


In [70]:
# create a new column 'interim' with a * b
df['interim'] = df.apply(lambda r: r.a * r.b,axis=1)
df


Out[70]:
   a   b   c  interim
0  0   1   2        0
1  3   4   5       12
2  6   7   8       42
3  9  10  11       90

In [72]:
# and now a 'result' column with 'interim' + 'c'
df['result'] = df.apply(lambda r: r.interim + r.c, axis=1)
df


Out[72]:
   a   b   c  interim  result
0  0   1   2        0       2
1  3   4   5       12      17
2  6   7   8       42      50
3  9  10  11       90     101

In [73]:
# replace column a with the sum of columns a,b and c
df.a = df.a + df.b + df.c
df


Out[73]:
    a   b   c  interim  result
0   3   1   2        0       2
1  12   4   5       12      17
2  21   7   8       42      50
3  30  10  11       90     101

Important point to note is that pandas DataFrame is not a spreadsheet where cells are assigned formulas and can be recalculated when cells that are referenced by the formula change. If you desire this to happen, you will need to execute the formulas whenever the dependent data changes. On the flip side, this is more efficient than with spreadsheets as every little change does not cause a cascade of operations to occur.

The .apply() method will always apply to the provided function to all of the items or rows or columns. If you want to apply the function to a subset of these, then first perform a Boolean selection to filter all the items you do not want to process.


In [75]:
# create a 3 X 5 dataframe
df = pd.DataFrame(np.arange(0,15).reshape(3,5))
df.loc[1,2]= np.nan
df


Out[75]:
    0   1     2   3   4
0   0   1   2.0   3   4
1   5   6   NaN   8   9
2  10  11  12.0  13  14

In [76]:
# demonstrate applying a function to only rows having
# a count of 0 NaN values
df.dropna().apply(lambda x:x.sum(), axis=1)


Out[76]:
0    10.0
2    60.0
dtype: float64

The .apply() method was always passed an entire row or column. If you desire to apply a function to every individual item in the DataFrame one by one then .applymap() is the method to use.


In [77]:
# use applymap to format all items of the DataFrame
df.applymap(lambda x: '%.2f' % x)


Out[77]:
       0      1      2      3      4
0   0.00   1.00   2.00   3.00   4.00
1   5.00   6.00    nan   8.00   9.00
2  10.00  11.00  12.00  13.00  14.00